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Abstract 

Neural networks have been shown to 
improve performance across a range of 
natural-language tasks. However, design¬ 
ing and training them can be complicated. 
Frequently, researchers resort to repeated 
experimentation to pick optimal settings. 

In this paper, we address the issue of 
choosing the correct number of units in 
hidden layers. We introduce a method for 
automatically adjusting network size by 
pruning out hidden units through foo,i and 
£ 2,1 regularization. We apply this method 
to language modeling and demonstrate its 
ability to correctly choose the number of 
hidden units while maintaining perplexity. 

We also include these models in a machine 
translation decoder and show that these 
smaller neural models maintain the signif¬ 
icant improvements of their unpruned ver¬ 
sions. 

1 Introduction 

Neural networks have proven to be highly ef¬ 
fective at many tasks in natural language. For 
example, neural language models and joint lan¬ 
guage/translation models improve machine trans¬ 
lation quality significantly (Vaswani et ah, 2013; 
Devlin et ak, 2014). However, neural networks can 
be complicated to design and train well. Many de¬ 
cisions need to be made, and performance can be 
highly dependent on making them correctly. Yet 
the optimal settings are non-obvious and can be 
laborious to find, often requiring an extensive grid 
search involving numerous experiments. 

In this paper, we focus on the choice of the 
sizes of hidden layers. We introduce a method 
for automatically pruning out hidden layer units, 
by adding a sparsity-inducing regularizer that en¬ 
courages units to deactivate if not needed, so that 


they can be removed from the network. Thus, af¬ 
ter training with more units than necessary, a net¬ 
work is produced that has hidden layers correctly 
sized, saving both time and memory when actually 
putting the network to use. 

Using a neural n-gram language model (Bengio 
et ak, 2003), we are able to show that our novel 
auto-sizing method is able to learn models that are 
smaller than models trained without the method, 
while maintaining nearly the same perplexity. The 
method has only a single hyperparameter to adjust 
(as opposed to adjusting the sizes of each of the 
hidden layers), and we find that the same setting 
works consistently well across different training 
data sizes, vocabulary sizes, and n-gram sizes. In 
addition, we show that incorporating these mod¬ 
els into a machine translation decoder still results 
in large BLEU point improvements. The result is 
that fewer experiments are needed to obtain mod¬ 
els that perform well and are correctly sized. 

2 Background 

Language models are often used in natural lan¬ 
guage processing tasks involving generation of 
text. For instance, in machine translation, the lan¬ 
guage model helps to output fluent translations, 
and in speech recognition, the language model 
helps to disambiguate among possible utterances. 

Current language models are usually n-gram 
models, which look at the previous (n — 1) words 
to predict the nth word in a sequence, based 
on (smoothed) counts of n-grams collected from 
training data. These models are simple but very 
effective in improving the performance of natural 
language systems. 

However, n-gram models suffer from some lim¬ 
itations, such as data sparsity and memory usage. 
As an alternative, researchers have begun explor¬ 
ing the use of neural networks for language mod¬ 
eling. For modeling n-grams, the most common 
approach is the feedforward network of Bengio et 



al. (2003), shown in Figure 1. 

Each node represents a unit or “neuron,” which 
has a real valued activation. The units are orga¬ 
nized into real-vector valued layers. The activa¬ 
tions at each layer are computed as follows. (We 
assume n = 3; the generalization is easy.) The two 
preceding words, wi,W 2 , are mapped into lower¬ 
dimensional word embeddings. 



then passed through two hidden layers, 

y = /(BV+BV + b) 

z = ./(Cy + c) 

where / is an elementwise nonlinear activation 
(or transfer) function. Commonly used activation 
functions are the hyperbolic tangent, logistic func¬ 
tion, and rectihed linear units, to name a few. Fi¬ 
nally, the result is mapped via a softmax to an out¬ 
put probability distribution, 

P{wn I Wi • • • Wn-i) oc exp([Dz -f d]„„). 

The parameters of the model are A, B^, B^, b, 
C, c, D, and d, which are learned by minimizing 
the negative log-likelihood of the the training data 
using stochastic gradient descent (also known as 
backpropagation) or variants. 

Vaswani et al. (2013) showed that this model, 
with some improvements, can be used effectively 
during decoding in machine translation. In this pa¬ 
per, we use and extend their implementation. 

3 Methods 

Our method is focused on the challenge of choos¬ 
ing the number of units in the hidden layers of a 
feed-forward neural network. The networks used 
for different tasks require different numbers of 
units, and the layers in a single network also re¬ 
quire different numbers of units. Choosing too few 
units can impair the performance of the network, 
and choosing too many units can lead to overfit¬ 
ting. It can also slow down computations with the 
network, which can be a major concern for many 
applications such as integrating neural language 
models into a machine translation decoder. 

Our method starts out with a large number of 
units in each layer and then jointly trains the net¬ 
work while pruning out individual units when pos¬ 
sible. The goal is to end up with a trained network 



Figure 1: Neural probabilistic language model 
(Bengio et al., 2003), adapted from Vaswani et al. 
(2013). 

that also has the optimal number of units in each 
layer. 

We do this by adding a regularizer to the ob¬ 
jective function. For simplicity, consider a single 
layer without bias, y = /(Wx). Let T(W) be 
the negative log-likelihood of the model. Instead 
of minimizing T(W) alone, we want to mini¬ 
mize L(W) + Ai?(W), where RifN) is a con¬ 
vex regularizer. The norm, R{fN) = ||W||i = 
'Yhi i ^ common choice for pushing pa¬ 

rameters to zero, which can be useful for prevent¬ 
ing overfitting and reducing model size. However, 
we are interested not only in reducing the number 
of parameters but the number of units. To do this, 
we need a different regularizer. 

We assume activation functions that satisfy 
/(O) = 0, such as the hyperbolic tangent or rec¬ 
tified linear unit (f{x) = max{0, a;}). Then, if we 
push the incoming weights of a unit yi to zero, that 
is, Wij = 0 for all j (as well as the bias, if any: 
bi — 0), then yi = /(O) = 0 is independent of the 
previous layers and contributes nothing to subse¬ 
quent layers. So the unit can be removed without 
affecting the network at all. Therefore, we need a 
regularizer that pushes all the incoming connec¬ 
tion weights to a unit together towards zero. 

Here, we experiment with two, the ^ 2 ,l norm 
and the foo.i norm.* The ^ 2 ,l norm on a ma- 

'in the notation the subscript p corresponds to the 
norm over each group of parameters, and q corresponds to 
the norm over the group norms. Contrary to more common 
usage, in this paper, the groups are rows, not columns. 
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Figure 2: The (unsquared) £2 norm and £00 norm 
both have sharp tips at the origin that encourage 
sparsity. 
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(If there are biases bi, they should be included as 
well.) This puts equal pressure on each row, but 
within each row, the larger values contribute more, 
and therefore there is more pressure on larger val¬ 
ues towards zero. The £ 00,1 norm is 

^(W) = ^||W^,:||oo=^max|W^y|. (2) 

i i ^ 

Again, this puts equal pressure on each row, but 
within each row, only the maximum value (or val¬ 
ues) matter, and therefore the pressure towards 
zero is entirely on the maximum value(s). 

Figure 2 visualizes the sparsity-inducing behav¬ 
ior of the two regularizers on a single row. Both 
have a sharp tip at the origin that encourages all 
the parameters in a row to become exactly zero. 

4 Optimization 

However, this also means that sparsity-inducing 
regularizers are not differentiable at zero, mak¬ 
ing gradient-based optimization methods trickier 
to apply. The methods we use are discussed in 
detail elsewhere (Duchi et al., 2008; Duchi and 
Singer, 2009); in this section, we include a short 
description of these methods for completeness. 

4.1 Proximal gradient method 

Most work on learning with regularizers, includ¬ 
ing this work, can be thought of as instances of 
the proximal gradient method (Parikh and Boyd, 
2014). Our objective function can be split into two 
parts, a convex and differentiable part (L) and a 




convex but non-differentiable part (XR). In prox¬ 
imal gradient descent, we alternate between im¬ 
proving L alone and XR alone. Let u be the pa¬ 
rameter values from the previous iteration. We 
compute new parameter values w using; 

V ^ u — 7yVL(u) (3) 

w •(—arg max f —llw — vlP-b Ai?(w) I (4) 
w \2r] J 

and repeat until convergence. The first update is 
just a standard gradient descent update on L; the 
second is known as the proximal operator for XR 
and in many cases has a closed-form solution. In 
the rest of this section, we provide some justifica¬ 
tion for this method, and in Sections 4.2 and 4.3 
we show how to compute the proximal operator 
for the £2 and £00 norms. 

We can think of the gradient descent update (3) 
on L as follows. Approximate L around u by the 
tangent plane, 

Z(v) = L(u)-b VL(u)(v — u) (5) 

and move v to minimize L, but don’t move it too 
far from u; that is, minimize 

W = ^||v-u||2 + Z(v). 

Setting partial derivatives to zero, we get 
dF 1 

— = -(v - u) -b VL(u) = 0 
av rj 

V = u — rjVL{\i). 

By a similar strategy, we can derive the second 
step (4). Again we want to move w to minimize 
the objective function, but don’t want to move it 
too far from u; that is, we want to minimize; 

G(w) = ;^|lw — ulP -b Z(w) -b XR(w). 

2rj 

Note that we have not approximated i? by a tan¬ 
gent plane. We can simplify this by substituting 
in (3). The first term becomes 

^||w - uf = ^||w - V - pVL{u)f 

— ~ ~ VI/(u)(w — v) 

+ |l|VT(u)||2 







and the second term becomes 


L{'w) = L{u) + Vi(u)(w — u) 

= i(u) + Vi(u)(w — V — r 7 VL(u)). 

The VT(u)(w — v) terms cancel out, and we can 
ignore terms not involving w, giving 

~ ■'^iP + A-R(w) + const. 
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which is minimized by the update (4). Thus, we 
have split the optimization step into two easier 
steps: first, do the update for L (3), then do the 
update for XR (4). The latter can often be done 
exactly (without approximating i? by a tangent 
plane). We show next how to do this for the £2 
and £ao norms. 


4.2 £2 and £2.1 regularization 

Since the £ 2,1 norm on matrices (1) is separable 
into the £2 norm of each row, we can treat each 
row separately. Thus, for simplicity, assume that 
we have a single row and want to minimize 

G'(w) = —llw — vll^ + A||w|| + const. 

2r] 

The minimum is either at w = 0 (the tip of 
the cone) or where the partial derivatives are zero 
(Figure 3): 


dG 



v) + A 


w 

||w| 


= 0 . 


Clearly, w and v must have the same direction and 
differ only in magnitude, that is, w = aji^- Sub¬ 
stituting this into the above equation, we get the 
solution 


a = |iv|| -r]X. 

Therefore the update is 


w = ay 


a = max(0, ||v|j — rjX). 


4.3 £00 and £ 00,1 regularization 

As above, since the f' 00,1 norm on matrices (2) is 
separable into the £^ norm of each row, we can 
treat each row separately; thus, we want to mini¬ 
mize 


G'(w) 



v|p -b A max \xj \ -b const. 



|lw|| > 0 



||w|| =0 


Figure 3: Examples of the two possible cases for 
the £2 gradient update. Point v is drawn with a hol¬ 
low dot, and point w is drawn with a solid dot. 



Figure 4: The proximal operator for the £00 norm 
(with strength r/A) decreases the maximal compo¬ 
nents until the total decrease sums to rjX. Projec¬ 
tion onto the £i-ball (of radius r/A) decreases each 
component by an equal amount until they sum 
to rjX. 


Intuitively, the solution can be characterized as: 
Decrease all of the maximal \xj \ until the total de¬ 
crease reaches rjX or all the Xj are zero. See Fig¬ 
ure 4. 

If we pre-sort the \xj\ in nonincreasing order, 
it’s easy to see how to compute this: for p = 
1, ... ,n, see if there is a value ^ < Xp such that 
decreasing all the xi,... ,Xpto ^ amounts to a to¬ 
tal decrease of pA. The largest p for which this is 
possible gives the correct solution. 

But this situation seems similar to another op¬ 
timization problem, projection onto the £i-ball, 
which Duchi et al. (2008) solve in linear time 
without pre-sorting. In fact, the two problems can 
be solved by nearly identical algorithms, because 
they are convex conjugates of each other (Duchi 
and Singer, 2009; Bach et al., 2012). Intuitively, 
the £i projection of v is exactly what is cut out 
by the £^0 proximal operator, and vice versa (Fig¬ 
ure 4). 

Duchi et al.’s algorithm modified for the present 
problem is shown as Algorithm 1. It partitions the 
Xj about a pivot element (line 6) and tests whether 
it and the elements to its left can be decreased to a 
value ^ such that the total decrease is 5 (line 8). If 
so, it recursively searches the right side; if not, the 




left side. At the conclusion of the algorithm, p is 
set to the largest value that passes the test (line 13), 
and hnally the new Xj are computed (line 16) - the 
only difference from Duchi et al.’s algorithm. 

This algorithm is asymptotically faster than that 
of Quattoni et al. (2009). They reformulate foo.i 
regularization as a constrained optimization prob¬ 
lem (in which the foo,i norm is bounded by p) and 
provide a solution in 0{n log n) time. The method 
shown here is simpler and faster because it can 
work on each row separately. 


Algorithm 1 Linear-time algorithm for the proxi¬ 
mal operator of the f 00 norm. 

1 

procedure update(w, S) 

2 

lo, hi ^ 1, n 

3 

s 0 

4 

while lo < hi do 

5 

select md randomly from lo,..., hi 

6 

p ^ PARTIT10n(w, lo, md, hi) 

7 

^ ^ ^ (s + T)i=lo \^i\ - '^) 

8 

if ^ < \xp\ then 

9 

S ^ S + Yl!i=lo \^i\ 

10 

lo i — p -f- 1 

11 

else 

12 

hi p — 1 

13 

p ^ hi 

14 


15 

for i ^ 1,... ,n do 

16 

Xi ^ min(max(a;i, — 

17 

procedure partition(w, lo, md, hi) 

18 

swap xio and x^d 

19 

t i — /o A 1 

20 

for j ^ lo + 1,... ,hi do 

21 

if Xj > xio then 

22 

swap Xi and Xj 

23 

i ^ i + 1 

24 

swap xio and Xi-i 

25 

return i — 1 


5 Experiments 

We evaluate our model using the open-source 
NPLM toolkit released by Vaswani et al. (2013), 
extending it to use the additional regularizers as 
described in this paper.^ We use a vocabulary size 
of 100k and word embeddings with 50 dimen¬ 
sions. We use two hidden layers of rectified linear 
units (Nair and Hinton, 2010). 

^These extensions have been contributed to the NPLM 
project. 


We train neural language models (LMs) on two 
natural language corpora, Europarl v7 English and 
the AEP portion of English Gigaword 5. After tok- 
enization, Europarl has 56M tokens and Gigaword 
AEP has 870M tokens. Eor both corpora, we hold 
out a validation set of 5,000 tokens. We train each 
model for 10 iterations over the training data. 

Our experiments break down into three parts. 
Eirst, we look at the impact of our pruning method 
on perplexity of a held-out validation set, across a 
variety of settings. Second, we take a closer look 
at how the model evolves through the training pro¬ 
cess. Einally, we explore the downstream impact 
of our method on a statistical phrase-based ma¬ 
chine translation system. 

5.1 Evaluating perplexity and network size 

We first look at the impact that the Coo,! regular- 
izer has on the perplexity of our validation set. The 
main results are shown in Table 1. Eor A < 0.01, 
the regularizer seems to have little impact: no hid¬ 
den units are pruned, and perplexity is also not af¬ 
fected. Eor A = 1, on the other hand, most hidden 
units are pruned - apparently too many, since per¬ 
plexity is worse. But for A = 0.1, we see that we 
are able to prune out many hidden units: up to half 
of the first layer, with little impact on perplexity. 
We found this to be consistent across all our exper¬ 
iments, varying n-gram size, initial hidden layer 
size, and vocabulary size. 

Table 2 shows the same information for 5-gram 
models trained on the larger Gigaword AEP cor¬ 
pus. These numbers look very similar to those on 
Europarl: again A = 0.1 works best, and, counter 
to expectation, even the final number of units is 
similar. 

Table 3 shows the result of varying the vocabu¬ 
lary size: again A = 0.1 works best, and, although 
it is not shown in the table, we also found that the 
hnal number of units did not depend strongly on 
the vocabulary size. 

Table 4 shows results using the £ 2,1 norm (Eu¬ 
roparl corpus, 5-grams, 100k vocabulary). Since 
this is a different regularizer, there isn’t any rea¬ 
son to expect that A behaves the same way, and 
indeed, a smaller value of A seems to work best. 

5.2 A closer look at training 

We also studied the evolution of the network over 
the training process to gain some insights into how 
the method works. The first question we want to 




A 

layer 1 

2-gram 
layer 2 

ppl 

3 

layer 1 

-gram 
layer 2 

ppl 

5 

layer 1 

-gram 
layer 2 

ppl 

0 

1,000 

50 

103 

1,000 

50 

66 

1,000 

50 

55 

0.001 

1,000 

50 

104 

1,000 

50 

66 

1,000 

50 

54 

0.01 

1,000 

50 

104 

1,000 

50 

63 

1,000 

50 

55 

0.1 

499 

47 

105 

652 

49 

66 

784 

50 

55 

1.0 

50 

24 

111 

128 

32 

76 

144 

29 

68 


Table 1: Comparison of foo,i regularization on 2-gram, 3-gram, and 5-gram neural language models. The 
network initially started with 1,000 units in the first hidden layer and 50 in the second. A regularization 
strength of A = 0.1 consistently is able to prune units while maintaining perplexity, even though the final 
number of units varies considerably across models. The vocabulary size is 100k. 


A 

layer 1 

layer 2 

perplexity 

0 

1,000 

50 

100 

0.001 

1,000 

50 

99 

0.01 

1,000 

50 

101 

0.1 

742 

50 

107 

1.0 

24 

17 

173 


Table 2; Results from training a 5-gram neural LM 
on the AFP portion of the Gigaword dataset. As 
with the smaller Europarl corpus (Table 1), a reg¬ 
ularization strength of A = 0.1 is able to prune 
units while maintaining perplexity. 


A 

vocabulary size 

10k 25k 50k 100k 

0 

47 

60 

54 

55 

0.001 

47 

54 

54 

54 

0.01 

47 

58 

55 

55 

0.1 

48 

62 

55 

55 

1.0 

61 

64 

65 

68 


Table 3; A regularization strength of A = 0.1 is 
best across different vocabulary sizes. 


A 

layer 1 

layer 2 

perplexity 

0 

1,000 

50 

100 

0.0001 

1,000 

50 

54 

0.001 

1,000 

50 

55 

0.01 

616 

50 

57 

0.1 

199 

32 

65 


Table 4; Results using £ 2,1 regularization. 



Figure 5; Number of units in first hidden layer over 
time, with various starting sizes (A = 0.1). If we 
start with too many units, we end up with the same 
number, although if we start with a smaller number 
of units, a few are still pruned away. 

answer is whether the method is simply remov¬ 
ing units, or converging on an optimal number of 
units. Figure 5 suggests that it is a little of both; 
if we start with too many units (900 or 1000), the 
method converges to the same number regardless 
of how many extra units there were initially. But 
if we start with a smaller number of units, the 
method still prunes away about 50 units. 

Next, we look at the behavior over time of dif¬ 
ferent regularization strengths A. We found that 
not only does A = 1 prune out too many units, it 
does so at the very first iteration (Figure 6, above), 
perhaps prematurely. By contrast, the A = 0.1 
run prunes out units gradually. By plotting these 
curves together with perplexity (Figure 6, below), 
we can see that the A = 0.1 run is fitting the model 
and pruning it at the same time, which seems 
preferable to fitting without any pruning (A = 


















Figure 6: Above: Number of units in first hid¬ 
den layer over time, for various regularization 
strengths A. A regularization strength of < 0.01 
does not zero out any rows, while a strength of 1 
zeros out rows right away. Below: Perplexity over 
time. The runs with A < 0.1 have very similar 
learning curves, whereas A = 1 is worse from the 
beginning. 



neural LM 

A 

none Luroparl 

Gigaword ALP 

0 (none) 

24.7 (+1.5) 
24.6 (+1.4) 

25.2 (+2.0) 

0.1 

24.9 (+1.7) 


Table 5: The improvements in translation accuracy 
due to the neural LM (shown in parentheses) are 
affected only slightly by £oo,i regularization. For 
the Europarl LM, there is no statistically signifi¬ 
cant difference, and for the Gigaword AFP LM, a 
statistically significant but small decrease of —0.3. 

0.01) or pruning first and then fitting (A = 1). 

We can also visualize the weight matrix itself 
over time (Ligure 7), for A = 0.1. It is striking 
that although this setting fits the model and prunes 
it at the same time, as argued above, by the first 
iteration it already seems to have decided roughly 
how many units it will eventually prune. 

5.3 Evaluating on machine translation 

We also looked at the impact of our method on 
statistical machine translation systems. We used 
the Moses toolkit (Koehn et ah, 2007) to build a 
phrase based machine translation system with a 
traditional 5-gram LM trained on the target side 
of our bitext. We augmented this system with neu¬ 
ral LMs trained on the Luroparl data and the Gi¬ 
gaword ALP data. Based on the results from the 
perplexity experiments, we looked at models both 
built with a A = 0.1 regularize^ and without regu¬ 
larization (A = 0). 

We built our system using the newscommentary 
dataset v8. We tuned our model using newstestlS 
and evaluated using newstestl4. After standard 
cleaning and tokenization, there were 155k paral¬ 
lel sentences in the newscommentary dataset, and 
3,000 sentences each for the tuning and test sets. 

Table 5 shows that the addition of a neural 
LM helps substantially over the baseline, with im¬ 
provements of up to 2 BLLU. Using the Luroparl 
model, the BLLU scores obtained without and 
with regularization were not significantly differ¬ 
ent (p > 0.05), consistent with the negligible per¬ 
plexity difference between these models. On the 
Gigaword ALP model, regularization did decrease 
the BLLU score by 0.3, consistent with the small 
perplexity increase of the regularized model. The 
decrease is statistically significant, but small com¬ 
pared with the overall benefit of adding a neu¬ 
ral LM. 
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Figure 7: Evolution of the first hidden layer weight matrix after 1, 5, and 10 iterations (with rows sorted 
by loo norm). A nonlinear color scale is used to show small values more clearly. The four vertical blocks 
correspond to the four context words. The light bar at the bottom is the rows that are close to zero, and 
the white bar is the rows that are exactly zero. 


6 Related Work 

Researchers have been exploring the use of neu¬ 
ral networks for language modeling for a long 
time. Schmidhuber and Heil (1996) proposed a 
character n-gram model using neural networks 
which they used for text compression. Xu and 
Rudnicky (2000) proposed a word-based proba¬ 
bility model using a softmax output layer trained 
using cross-entropy, but only for bigrams. Bengio 
et al. (2003) defined a probabilistic word n-gram 
model and demonstrated improvements over con¬ 
ventional smoothed language models. Mnih and 
Teh (2012) sped up training of log-bilinear lan¬ 
guage models through the use of noise-contrastive 
estimation (NCE). Vaswani et al. (2013) also 
used NCE to train the architecture of Bengio et 
al. (2003), and were able to integrate a large- 
vocabulary language model directly into a ma¬ 
chine translation decoder. Baltescu et al. (2014) 
describe a similar model, with extensions like a 
hierarchical softmax (based on Brown clustering) 
and direct n-gram features. 

Beyond feed-forward neural network lan¬ 
guage models, researchers have explored using 
more complicated neural network architectures. 
RNNLM is an open-source implementation of a 
language model using recurrent neural networks 
(RNN) where connections between units can form 
directed cycles (Mikolov et al., 2011). Sunder- 
meyer et al. (2015) use the long-short term mem¬ 
ory (LSTM) neural architecture to show a per¬ 
plexity improvement over the RNNLM toolkit. 
In future work, we plan on exploring how our 
method could improve these more complicated 
neural models as well. 


Automatically limiting the size of neural net¬ 
works is an old idea. The “Optimal Brain Dam¬ 
age” (OBD) technique (LeCun et al., 1989) com¬ 
putes a saliency based on the second derivative of 
the objective function with respect to each parame¬ 
ter. The parameters are then sorted by saliency, and 
the lowest-saliency parameters are pruned. The 
pruning process is separate from the training pro¬ 
cess, whereas regularization performs training and 
pruning simultaneously. Regularization in neural 
networks is also an old idea; for example. Now- 
land and Hinton (1992) mention both and Iq 
regularization. Our method develops on this idea 
by using a mixed norm to prune units, rather than 
parameters. 

Srivastava et al. introduce a method called 
dropout in which units are directly deactivated at 
random during training (Srivastava et al., 2014), 
which induces sparsity in the hidden unit activa¬ 
tions. However, at the end of training, all units 
are reactivated, as the goal of dropout is to re¬ 
duce overfitting, not to reduce network size. Thus, 
dropout and our method seem to be complemen¬ 
tary. 

7 Conclusion 

We have presented a method for auto-sizing a neu¬ 
ral network during training by removing units us¬ 
ing a foo.i regularizer. This regularizer drives a 
unit’s input weights as a group down to zero, al¬ 
lowing the unit to be pruned. We can thus prune 
units out of our network during training with min¬ 
imal impact to held-out perplexity or downstream 
performance of a machine translation system. 

Our results showed empirically that the choice 












of a regularization coefficient of 0.1 was robust to 
initial configuration parameters of initial network 
size, vocabulary size, n-gram order, and training 
corpus. Furthermore, imposing a single regularizer 
on the objective function can tune all of the hidden 
layers of a network with one setting. This reduces 
the need to conduct expensive, multi-dimensional 
grid searches in order to determine optimal sizes. 

We have demonstrated the power and efficacy 
of this method on a feed-forward neural network 
for language modeling though experiments on per¬ 
plexity and machine translation. However, this 
method is general enough that it should be applica¬ 
ble to other domains, both inside natural language 
processing and outside. As neural models become 
more pervasive in natural language processing, the 
ability to auto-size networks for fast experimen¬ 
tation and quick exploration will become increas¬ 
ingly important. 
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